Multiple Regression Assumptions and Diagnostics

Statistics for Data Science II

Introduction

  • Let us now review the “checks” we will perform on our models.

    • Model assumptions

    • Outliers

    • Influence

    • Leverage

    • Multicollinearity

Model Assumptions

  • Recall the glm, y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... \beta_k x_k + \varepsilon where \varepsilon is the residual error term.

  • Recall that the residual error is defined as \varepsilon = y - \hat{y} where

    • y is the observed value

    • \hat{y} is the predicted value

  • The residual tells us how far away the observation is from the response surface (the predicted value).

  • Ordinary least squares estimation means that we have minimized the overall error.

Model Assumptions

  • Recall the regression (and ANOVA) assumptions,

\varepsilon \overset{\text{iid}}{\sim} N(0, \sigma^2)

  • The assumptions are on the residual!

  • What this means:

    • Residuals are normally distributed

    • Distribution of residuals is centered at 0

    • Variance of residuals is some constant \sigma^2

Model Assumptions

  • We will assess the assumptions graphically.

    • Constant variance: scatterplot

    • Normal distribution: q-q plot

  • A package was written by a former student, classpackage.

    • If you are working on the server, the package is already installed.

    • If you are not working on the server, you need to install the package:

# install.packages("devtools") - use this if R tells you it can't find install_github()
library(devtools)
install_github("ieb2/class_package", force = TRUE)

Model Assumptions

  • Once installed, we call the package,
library(classpackage)
  • While there are several functions in this package, we are interested in the anova_check() function.
m <- glm(y ~ x1 + x2 + ..., data = dataset)
anova_check(m)
  • This will provide the scatterplot and the q-q plot.

Data about Bee Colonies

library(tidyverse)
library(fastDummies)
colony <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-01-11/colony.csv') %>%
  mutate(quarter = case_when(months == "January-March" ~ "Q1", 
                             months == "April-June" ~ "Q2",
                             months == "July-September" ~ "Q3",
                             months == "October-December" ~ "Q4")) %>%
  dummy_cols(select_columns = c("quarter")) %>%
  select(quarter, year, state, colony_lost, quarter_Q1, quarter_Q2, quarter_Q3, quarter_Q4, colony_max)
head(colony, n=5)

Model Assumptions

  • Further, recall the model we constructed,
m1 <- glm(colony_lost ~ quarter_Q1 + quarter_Q2 + quarter_Q3 + year + colony_max, data = colony)
summary(m1)

Call:
glm(formula = colony_lost ~ quarter_Q1 + quarter_Q2 + quarter_Q3 + 
    year + colony_max, data = colony)

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.274e+05  2.203e+05   1.940   0.0526 .  
quarter_Q1   5.622e+02  5.785e+02   0.972   0.3314    
quarter_Q2  -3.496e+03  5.987e+02  -5.839 6.84e-09 ***
quarter_Q3   5.952e+02  5.977e+02   0.996   0.3196    
year        -2.119e+02  1.092e+02  -1.941   0.0525 .  
colony_max   1.167e-01  1.086e-03 107.466  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 49277184)

    Null deviance: 6.2772e+11  on 1149  degrees of freedom
Residual deviance: 5.6373e+10  on 1144  degrees of freedom
  (72 observations deleted due to missingness)
AIC: 23641

Number of Fisher Scoring iterations: 2

Model Assumptions

  • Now, let’s check the assumptions on the model,
library(classpackage)
anova_check(m1)

Model Fit: R^2

  • If we want to know how well the model fits the particular dataset at hand, we can look at the R^2.

    • R^2 is the proportion of variance explained by the model.

    • Because it is a proportion, R^2 \in [0, 1] and is independent of the units of y.

  • If R^2 \to 0, the model does not fit the data well; if R^2 \to 1, the model fits the data well.

    • Note that if R^2=1, all observations fall on the response surface.

R^2 = \frac{\text{SSReg}}{\text{SSTot}}

Model Fit: R^2

  • Remember that we are partitioning the variability in y (SSTot), which is constant, into two pieces:

    • The variability explained by the regression model (SSReg).

    • The variability explained by outside sources (SSE).

  • As predictors are added to the model, we necessarily increase SSReg / decrease SSE.

  • We want a measure of model fit that is resistant to this fluctuation, R^2_{\text{adj}} = 1 - \left( \frac{n-1}{n-k-1} \right) \left( 1 - R^2 \right), where k is the number of predictor terms in the model.

Model Fit: R^2

  • We will use the adjR2() function from the glmtoolbox package to find R^2_{\text{adj}}.

  • In our example,

library(glmtoolbox)
(adjR2(m1))
[1] 0.9098

Model Diagnostics: Outliers

  • Definition: data values that are much larger or smaller than the rest of the values in the dataset.

  • We will look at the standardized residuals, e_{i_{\text{standardized}}} = \frac{e_i}{\sqrt{\text{MSE}(1-h_i)}}, where

    • e_i = y_i - \hat{y}_i is the residual of the ith observation
    • h_i is the leverage of the ith observation
  • If |e_{i_{\text{standardized}}}| > 2.5 \ \to \ outlier.

  • If |e_{i_{\text{standardized}}}| > 3 \ \to \ extreme outlier.

Model Diagnostics: Outliers

  • We will use the rstandard() function to find the residuals.

  • For ease of examining in large datasets, we will use it to create a “flag.”

dataset <- dataset %>%
  mutate(outlier =  if_else(abs(rstandard(m))>2.5, "Suspected", "Not Suspected"))
  • We can count the number of outliers,
dataset %>% count(outlier)
  • We can just look at outliers from the dataset,
new_dataset <- dataset %>% 
  filter(outlier == TRUE)
  • We can also exclude outliers from the dataset,
new_dataset <- dataset %>% 
  filter(outlier == FALSE)

Model Diagnostics: Outliers

  • Let’s check for outliers in our example data,
colony <- colony %>% 
  mutate(outlier =  if_else(abs(rstandard(m1))>2.5, "Suspected", "Not Suspected"))
Error in `mutate()`:
ℹ In argument: `outlier = if_else(abs(rstandard(m1)) > 2.5, "Suspected",
  "Not Suspected")`.
Caused by error:
! `outlier` must be size 1222 or 1, not 1150.
  • Oops! We have missing data, so R does not like our request.

Model Diagnostics: Outliers

  • Let’s check for outliers in our example data,
summary(m1)

Call:
glm(formula = colony_lost ~ quarter_Q1 + quarter_Q2 + quarter_Q3 + 
    year + colony_max, data = colony)

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.274e+05  2.203e+05   1.940   0.0526 .  
quarter_Q1   5.622e+02  5.785e+02   0.972   0.3314    
quarter_Q2  -3.496e+03  5.987e+02  -5.839 6.84e-09 ***
quarter_Q3   5.952e+02  5.977e+02   0.996   0.3196    
year        -2.119e+02  1.092e+02  -1.941   0.0525 .  
colony_max   1.167e-01  1.086e-03 107.466  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 49277184)

    Null deviance: 6.2772e+11  on 1149  degrees of freedom
Residual deviance: 5.6373e+10  on 1144  degrees of freedom
  (72 observations deleted due to missingness)
AIC: 23641

Number of Fisher Scoring iterations: 2
colony2 <- colony %>%
  select(colony_lost, quarter_Q1, quarter_Q2, quarter_Q3, year, colony_max) %>%
  na.omit()
m2 <- glm(colony_lost ~ quarter_Q1 + quarter_Q2 + quarter_Q3 + year + colony_max, data = colony2)
summary(m2)

Call:
glm(formula = colony_lost ~ quarter_Q1 + quarter_Q2 + quarter_Q3 + 
    year + colony_max, data = colony2)

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.274e+05  2.203e+05   1.940   0.0526 .  
quarter_Q1   5.622e+02  5.785e+02   0.972   0.3314    
quarter_Q2  -3.496e+03  5.987e+02  -5.839 6.84e-09 ***
quarter_Q3   5.952e+02  5.977e+02   0.996   0.3196    
year        -2.119e+02  1.092e+02  -1.941   0.0525 .  
colony_max   1.167e-01  1.086e-03 107.466  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 49277184)

    Null deviance: 6.2772e+11  on 1149  degrees of freedom
Residual deviance: 5.6373e+10  on 1144  degrees of freedom
AIC: 23641

Number of Fisher Scoring iterations: 2
colony2 <- colony2 %>% 
  mutate(outlier =  if_else(abs(rstandard(m2))>2.5, "Suspected", "Not Suspected"))
head(colony2, n = 3)

Model Diagnostics: Outliers

  • Let’s investigate further.
  • How many data points are outliers?
colony2 %>% count(outlier)
  • There are 32 outliers (as defined by the residual \ge 2.5)

colony2 %>% ggplot(aes(x = colony_max, y = colony_lost, color = outlier)) +
  geom_point() + 
  scale_color_manual(values = c("#999999", "#000000")) +
  labs(x = "Total Colonies", y = "Colonies Lost", color = "Outlier") +
  theme_bw()

Model Diagnostics: Leverage and Influential Points

  • A leverage point is defined as follows:

    • A point for which x_i is far from the other values.
  • An influential point is defined as follows:

    • A point that significantly affects the regression model.
  • We check these together using Cook’s distance.

    • We will look for “spikes” in the plot.
  • We will use the cooks() function from the classpackage package.

cooks(m)

Model Diagnostics: Leverage and Influential Points

  • Let’s assess leverage and influence in our example.
cooks(m2)

  • I might include observation 4 when checking.

Model Diagnostics: Multicollinearity

  • Collinearity/multicollinearity: a correlation between two or more predictor variables affects the estimation procedure.

  • We will use the variance inflation factor (VIF) to check for multicollinearity. \text{VIF}_j = \frac{1}{1-R^2_j},

  • where

    • j = the predictor of interest and j \in \{1, 2, ..., k \},
    • R^2_j results from regressing x_j on the remaining (k-1) predictors.
  • We say that multicollinearity is present if VIF > 10.

  • How do we deal with multicollinearity?

    • Easy answer: remove at least one predictor from the collinear set, then reassess VIF.

    • More complicated: discussing with collaborators/bosses.

Model Diagnostics: Multicollinearity

  • We will use the vif() function from the car package.
library(car)
vif(m)
  • Note: the car package overwrites some functions from tidyverse, so I typically do not load the full library.

  • There will be a VIF value for each predictor in the model.

    • Note that VIF can sometimes be high for related categorical terms.

Model Diagnostics: Multicollinearity

  • Let’s check the multicollinearity for our data,
car::vif(m2)
quarter_Q1 quarter_Q2 quarter_Q3       year colony_max 
  1.574555   1.525640   1.520629   1.013434   1.000756 
  • No multicollinearity is present.

Sensitivity Analysis

  • We can perform sensitivity analysis to determine how much our model changes when we exclude the outliers.

    • Model 1: model using data with all observations
    • Model 2: model using data without identified outliers
  • Questions we will ask:

    • How different are the \hat{\beta}_i?
    • Did a predictor go from being significant to non-significant? (or vice-versa?)
    • Does the direction of \hat{\beta}_i change?
    • What is the difference in R^2?
  • We only look at sensitivity analysis once (i.e., only remove data points once for reanalysis).

    • If we keep going, we will whittle down the dataset to as close to a “perfect fit” as possible, introducing other issues.

Sensitivity Analysis

  • Let’s perform sensitivity analysis on our example model.
colony_outliers <- colony2 %>% filter(outlier == "Suspected")
head(colony_outliers, n=5)
m2 <- glm(colony_lost ~ quarter_Q1 + quarter_Q2 + quarter_Q3 + year + colony_max, data = colony2)
summary(m2)

Call:
glm(formula = colony_lost ~ quarter_Q1 + quarter_Q2 + quarter_Q3 + 
    year + colony_max, data = colony2)

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.274e+05  2.203e+05   1.940   0.0526 .  
quarter_Q1   5.622e+02  5.785e+02   0.972   0.3314    
quarter_Q2  -3.496e+03  5.987e+02  -5.839 6.84e-09 ***
quarter_Q3   5.952e+02  5.977e+02   0.996   0.3196    
year        -2.119e+02  1.092e+02  -1.941   0.0525 .  
colony_max   1.167e-01  1.086e-03 107.466  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 49277184)

    Null deviance: 6.2772e+11  on 1149  degrees of freedom
Residual deviance: 5.6373e+10  on 1144  degrees of freedom
AIC: 23641

Number of Fisher Scoring iterations: 2
colony_no_outliers <- colony2 %>% filter(outlier == "Not Suspected")
m3 <- glm(colony_lost ~ quarter_Q1 + quarter_Q2 + quarter_Q3 + year + colony_max, data = colony_no_outliers)
summary(m3)

Call:
glm(formula = colony_lost ~ quarter_Q1 + quarter_Q2 + quarter_Q3 + 
    year + colony_max, data = colony_no_outliers)

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.192e+05  1.117e+05   1.963   0.0499 *  
quarter_Q1  -8.762e+01  2.929e+02  -0.299   0.7649    
quarter_Q2  -1.810e+03  3.051e+02  -5.933 3.97e-09 ***
quarter_Q3   2.071e+02  3.026e+02   0.684   0.4938    
year        -1.087e+02  5.535e+01  -1.963   0.0499 *  
colony_max   1.121e-01  8.059e-04 139.035  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 12331684)

    Null deviance: 2.5457e+11  on 1117  degrees of freedom
Residual deviance: 1.3713e+10  on 1112  degrees of freedom
AIC: 21435

Number of Fisher Scoring iterations: 2

Wrap Up

  • Today we have covered the basics of

    • model assumptions

    • model diagnostics

    • sensitivity analysis

  • Next week, we will learn about a new type of term: an interaction.

    • Continuous \times continuous

    • Continuous \times categorical

    • Categorical \times categorical